Collecting Natural SMS and Chat Conversations in Multiple Languages: The BOLT Phase 2 Corpus
نویسندگان
چکیده
The DARPA BOLT Program develops systems capable of allowing English speakers to retrieve and understand information from informal foreign language sources. Phase 2 of the program required large volumes of naturally occurring informal text (SMS) and chat messages from individual users in multiple languages to support evaluation of machine translation systems. We describe the design and implementation of a robust collection system capable of capturing both live and archived SMS and chat conversations from willing participants. We also discuss the challenges recruitment at a time when potential participants have acute and growing concerns about their personal privacy in the realm of digital communication, and we outline the techniques adopted to confront those challenges. Finally, we review the properties of the resulting BOLT Phase 2 Corpus, which comprises over 6.5 million words of naturally-occurring chat and SMS in English, Chinese and Egyptian Arabic.
منابع مشابه
Transliteration of Arabizi into Arabic Orthography: Developing a Parallel Annotated Arabizi-Arabic Script SMS/Chat Corpus
This paper describes the process of creating a novel resource, a parallel Arabizi-Arabic script corpus of SMS/Chat data. The language used in social media expresses many differences from other written genres: its vocabulary is informal with intentional deviations from standard orthography such as repeated letters for emphasis; typos and nonstandard abbreviations are common; and nonlinguistic co...
متن کاملLanguage and the Socio-Cultural Worlds of Those Who Use it: A Case of Vague Expressions
The present study is an attempt to investigate the use of vague expressions by intermediate EFL learners. More specifically, the current study focuses on the structures and functions of one of the most common categories of vague language, i.e. general extenders. The data include a 22-hour corpus of English-as-a-foreign-language conversations. A comparison is also made between this corpus and a...
متن کاملYou Talking to Me? A Corpus and Algorithm for Conversation Disentanglement
When multiple conversations occur simultaneously, a listener must decide which conversation each utterance is part of in order to interpret and respond to it appropriately. We refer to this task as disentanglement. We present a corpus of Internet Relay Chat (IRC) dialogue in which the various conversations have been manually disentangled, and evaluate annotator reliability. This is, to our know...
متن کاملMPC: A Multi-Party Chat Corpus for Modeling Social Phenomena in Discourse
In this paper, we describe our experience with collecting and creating an annotated corpus of multi-party online conversations in a chat-room environment. This effort is part of a larger project to develop computational models of social phenomena such as agenda control, influence, and leadership in on-line interactions. Such models will help capturing the dialogue dynamics that are essential fo...
متن کاملThe Query of Everything: Developing Open-Domain, Natural-Language Queries for BOLT Information Retrieval
The DARPA BOLT Information Retrieval evaluations target open-domain natural-language queries over a large corpus of informal text in English, Chinese and Egyptian Arabic. We outline the goals of BOLT IR, comparing it with the prior GALE Distillation task. After discussing the properties of the BOLT IR corpus, we provide a detailed description of the query creation process, contrasting the summa...
متن کامل